Data Science with

DAY 2 pm



1 CASE STUDY. HIV prevalence in the world - excel data from Gapminder



The website Gapminder has a large colection of data sets, mostly in excel format.

We will retrieve the data about Adults with HIV (estimated prevalence of HIV in percentage, ages 15-49) from Gapminder. The url is https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx

The observational units are the countries, a fixed variable is the year the estimated prevalence corresponds to and the measured variable is the estimated prevalence.

The function read_excel() cannot download excel files directly from the web.

We use the function download.file() to download the file into a directory and then we use read_excel() to read it into R.

#required libraries
library(dplyr)
library(tidyr)
library(stringr)
library(readxl)
library(ggplot2)
library(ggrepel)
url <- "https://docs.google.com/spreadsheet/pub?key=pyj6tScZqmEfbZyl0qjbiRQ&output=xlsx"
download.file(url, "DataFiles/HIV.xlsx")
HIV <- read_excel("DataFiles/HIV.xlsx")
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    275 obs. of  34 variables:
##  $ Estimated HIV Prevalence% - (Ages 15-49): chr  "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
##  $ 1979.0                                  : num  NA NA NA NA NA ...
##  $ 1980.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1981.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1982.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1983.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1984.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1985.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1986.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1987.0                                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1988.0                                  : logi  NA NA NA NA NA NA ...
##  $ 1989.0                                  : logi  NA NA NA NA NA NA ...
##  $ 1990.0                                  : num  NA NA NA NA 0.06 NA NA 0.5 NA NA ...
##  $ 1991.0                                  : num  NA NA NA NA 0.06 NA NA 0.8 NA NA ...
##  $ 1992.0                                  : num  NA NA NA NA 0.06 NA NA 1 NA NA ...
##  $ 1993.0                                  : num  NA NA NA NA 0.06 NA NA 1.2 NA NA ...
##  $ 1994.0                                  : num  NA NA NA NA 0.06 NA NA 1.4 NA NA ...
##  $ 1995.0                                  : num  NA NA NA NA 0.06 NA NA 1.6 NA NA ...
##  $ 1996.0                                  : num  NA NA NA NA 0.06 NA NA 1.7 NA NA ...
##  $ 1997.0                                  : num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ 1998.0                                  : num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ 1999.0                                  : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2000.0                                  : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2001.0                                  : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2002.0                                  : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2003.0                                  : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2004.0                                  : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2005.0                                  : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2006.0                                  : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2007.0                                  : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2008.0                                  : num  NA NA NA NA 0.1 NA NA 2 NA NA ...
##  $ 2009                                    : chr  NA "0.06" NA NA ...
##  $ 2010                                    : chr  NA "0.06" NA NA ...
##  $ 2011                                    : chr  NA "0.06" NA NA ...
head(HIV)
## # A tibble: 6 x 34
##   `Estimated HIV … `1979.0` `1980.0` `1981.0` `1982.0` `1983.0` `1984.0`
##   <chr>               <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 Abkhazia               NA       NA       NA       NA       NA       NA
## 2 Afghanistan            NA       NA       NA       NA       NA       NA
## 3 Akrotiri and Dh…       NA       NA       NA       NA       NA       NA
## 4 Albania                NA       NA       NA       NA       NA       NA
## 5 Algeria                NA       NA       NA       NA       NA       NA
## 6 American Samoa         NA       NA       NA       NA       NA       NA
## # ... with 27 more variables: `1985.0` <dbl>, `1986.0` <dbl>,
## #   `1987.0` <dbl>, `1988.0` <lgl>, `1989.0` <lgl>, `1990.0` <dbl>,
## #   `1991.0` <dbl>, `1992.0` <dbl>, `1993.0` <dbl>, `1994.0` <dbl>,
## #   `1995.0` <dbl>, `1996.0` <dbl>, `1997.0` <dbl>, `1998.0` <dbl>,
## #   `1999.0` <dbl>, `2000.0` <dbl>, `2001.0` <dbl>, `2002.0` <dbl>,
## #   `2003.0` <dbl>, `2004.0` <dbl>, `2005.0` <dbl>, `2006.0` <dbl>,
## #   `2007.0` <dbl>, `2008.0` <dbl>, `2009` <chr>, `2010` <chr>,
## #   `2011` <chr>
names(HIV)
##  [1] "Estimated HIV Prevalence% - (Ages 15-49)"
##  [2] "1979.0"                                  
##  [3] "1980.0"                                  
##  [4] "1981.0"                                  
##  [5] "1982.0"                                  
##  [6] "1983.0"                                  
##  [7] "1984.0"                                  
##  [8] "1985.0"                                  
##  [9] "1986.0"                                  
## [10] "1987.0"                                  
## [11] "1988.0"                                  
## [12] "1989.0"                                  
## [13] "1990.0"                                  
## [14] "1991.0"                                  
## [15] "1992.0"                                  
## [16] "1993.0"                                  
## [17] "1994.0"                                  
## [18] "1995.0"                                  
## [19] "1996.0"                                  
## [20] "1997.0"                                  
## [21] "1998.0"                                  
## [22] "1999.0"                                  
## [23] "2000.0"                                  
## [24] "2001.0"                                  
## [25] "2002.0"                                  
## [26] "2003.0"                                  
## [27] "2004.0"                                  
## [28] "2005.0"                                  
## [29] "2006.0"                                  
## [30] "2007.0"                                  
## [31] "2008.0"                                  
## [32] "2009"                                    
## [33] "2010"                                    
## [34] "2011"

The name of the column with country names contains the title of the worksheet. The other columns contain the prevalence of HIV by year but some of the column names seem to be numerical. It is best to skip the first row containing column names and assign these in R.

HIV <- read_excel("DataFiles/HIV.xlsx", skip =1, col_names = F)
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    275 obs. of  34 variables:
##  $ X__1 : chr  "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
##  $ X__2 : num  NA NA NA NA NA ...
##  $ X__3 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__4 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__5 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__6 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__7 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__8 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__9 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__10: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ X__11: logi  NA NA NA NA NA NA ...
##  $ X__12: logi  NA NA NA NA NA NA ...
##  $ X__13: num  NA NA NA NA 0.06 NA NA 0.5 NA NA ...
##  $ X__14: num  NA NA NA NA 0.06 NA NA 0.8 NA NA ...
##  $ X__15: num  NA NA NA NA 0.06 NA NA 1 NA NA ...
##  $ X__16: num  NA NA NA NA 0.06 NA NA 1.2 NA NA ...
##  $ X__17: num  NA NA NA NA 0.06 NA NA 1.4 NA NA ...
##  $ X__18: num  NA NA NA NA 0.06 NA NA 1.6 NA NA ...
##  $ X__19: num  NA NA NA NA 0.06 NA NA 1.7 NA NA ...
##  $ X__20: num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ X__21: num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ X__22: num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ X__23: num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ X__24: num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ X__25: num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ X__26: num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ X__27: num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ X__28: num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ X__29: num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ X__30: num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ X__31: num  NA NA NA NA 0.1 NA NA 2 NA NA ...
##  $ X__32: chr  NA "0.06" NA NA ...
##  $ X__33: chr  NA "0.06" NA NA ...
##  $ X__34: chr  NA "0.06" NA NA ...
head(HIV)
## # A tibble: 6 x 34
##   X__1    X__2  X__3  X__4  X__5  X__6  X__7  X__8  X__9 X__10 X__11 X__12
##   <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 Abkha…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## 2 Afgha…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## 3 Akrot…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## 4 Alban…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## 5 Alger…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## 6 Ameri…    NA    NA    NA    NA    NA    NA    NA    NA    NA NA    NA   
## # ... with 22 more variables: X__13 <dbl>, X__14 <dbl>, X__15 <dbl>,
## #   X__16 <dbl>, X__17 <dbl>, X__18 <dbl>, X__19 <dbl>, X__20 <dbl>,
## #   X__21 <dbl>, X__22 <dbl>, X__23 <dbl>, X__24 <dbl>, X__25 <dbl>,
## #   X__26 <dbl>, X__27 <dbl>, X__28 <dbl>, X__29 <dbl>, X__30 <dbl>,
## #   X__31 <dbl>, X__32 <chr>, X__33 <chr>, X__34 <chr>
names(HIV)
##  [1] "X__1"  "X__2"  "X__3"  "X__4"  "X__5"  "X__6"  "X__7"  "X__8" 
##  [9] "X__9"  "X__10" "X__11" "X__12" "X__13" "X__14" "X__15" "X__16"
## [17] "X__17" "X__18" "X__19" "X__20" "X__21" "X__22" "X__23" "X__24"
## [25] "X__25" "X__26" "X__27" "X__28" "X__29" "X__30" "X__31" "X__32"
## [33] "X__33" "X__34"
aux <- seq(1979, 2011, 1)
names(HIV) <- c("Country", as.character(aux))
head(HIV)
## # A tibble: 6 x 34
##   Country `1979` `1980` `1981` `1982` `1983` `1984` `1985` `1986` `1987`
##   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Abkhaz…     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 2 Afghan…     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 3 Akroti…     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 4 Albania     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 5 Algeria     NA     NA     NA     NA     NA     NA     NA     NA     NA
## 6 Americ…     NA     NA     NA     NA     NA     NA     NA     NA     NA
## # ... with 24 more variables: `1988` <lgl>, `1989` <lgl>, `1990` <dbl>,
## #   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>,
## #   `1996` <dbl>, `1997` <dbl>, `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
## #   `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>,
## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <chr>, `2010` <chr>,
## #   `2011` <chr>

The last three columns have been read as character (?) and the columns corresponding to 1988 and 1989 are of class logical because all their entries are NAs.

Let us coerce the last three columns to numeric mode.

HIV <- HIV %>%
  mutate_at(32:34, as.numeric)
str(HIV)
## Classes 'tbl_df', 'tbl' and 'data.frame':    275 obs. of  34 variables:
##  $ Country: chr  "Abkhazia" "Afghanistan" "Akrotiri and Dhekelia" "Albania" ...
##  $ 1979   : num  NA NA NA NA NA ...
##  $ 1980   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1981   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1982   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1983   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1984   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1985   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1986   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1987   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ 1988   : logi  NA NA NA NA NA NA ...
##  $ 1989   : logi  NA NA NA NA NA NA ...
##  $ 1990   : num  NA NA NA NA 0.06 NA NA 0.5 NA NA ...
##  $ 1991   : num  NA NA NA NA 0.06 NA NA 0.8 NA NA ...
##  $ 1992   : num  NA NA NA NA 0.06 NA NA 1 NA NA ...
##  $ 1993   : num  NA NA NA NA 0.06 NA NA 1.2 NA NA ...
##  $ 1994   : num  NA NA NA NA 0.06 NA NA 1.4 NA NA ...
##  $ 1995   : num  NA NA NA NA 0.06 NA NA 1.6 NA NA ...
##  $ 1996   : num  NA NA NA NA 0.06 NA NA 1.7 NA NA ...
##  $ 1997   : num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ 1998   : num  NA NA NA NA 0.06 NA NA 1.8 NA NA ...
##  $ 1999   : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2000   : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2001   : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2002   : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2003   : num  NA NA NA NA 0.06 NA NA 1.9 NA NA ...
##  $ 2004   : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2005   : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2006   : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2007   : num  NA NA NA NA 0.1 NA NA 1.9 NA NA ...
##  $ 2008   : num  NA NA NA NA 0.1 NA NA 2 NA NA ...
##  $ 2009   : num  NA 0.06 NA NA NA NA NA 2.1 NA NA ...
##  $ 2010   : num  NA 0.06 NA NA NA NA NA 2.1 NA NA ...
##  $ 2011   : num  NA 0.06 NA NA NA NA NA 2.1 NA NA ...

The columns up to 1990 are mostly NAs and so we will remove them from the data set

#keep only columns 13 to 34
HIV <- select(HIV, c(1,13:34))
HIV
## # A tibble: 275 x 23
##    Country `1990` `1991` `1992` `1993` `1994` `1995` `1996` `1997` `1998`
##    <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1 Abkhaz…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  2 Afghan…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  3 Akroti…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  4 Albania  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  5 Algeria   0.06   0.06   0.06   0.06   0.06   0.06   0.06   0.06   0.06
##  6 Americ…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  7 Andorra  NA     NA     NA     NA     NA     NA     NA     NA     NA   
##  8 Angola    0.5    0.8    1      1.2    1.4    1.6    1.7    1.8    1.8 
##  9 Anguil…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
## 10 Antigu…  NA     NA     NA     NA     NA     NA     NA     NA     NA   
## # ... with 265 more rows, and 13 more variables: `1999` <dbl>,
## #   `2000` <dbl>, `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>,
## #   `2005` <dbl>, `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>,
## #   `2010` <dbl>, `2011` <dbl>

Now, let us tidy the data ready for analysis.

An observational unit is a country and the variables are year and prevalence of HIV. So, the tidy version of the data has three columns: country, year and prevalence.

Let us gather the columns with a year number into one single column named Year and put the corresponding values of prevalence under a column named PrevalenceHIV

HIV2 <- gather(HIV, "Year", "PrevalenceHIV", -Country)
glimpse(HIV2)
## Observations: 6,050
## Variables: 3
## $ Country       <chr> "Abkhazia", "Afghanistan", "Akrotiri and Dhekeli...
## $ Year          <chr> "1990", "1990", "1990", "1990", "1990", "1990", ...
## $ PrevalenceHIV <dbl> NA, NA, NA, NA, 0.06, NA, NA, 0.50, NA, NA, 0.30...

The data is tidy.

1.1 Visualizing the data

Let us visualise some of the data.

We will visualise using the concepts and additional data in Gapminder.org.

The HIV prevalence data will be plotted vs. Income (GDP per capita, PPP$ inflation-adjusted). The income data in Gapminder is in excel format in the url https://docs.google.com/spreadsheets/d/1PybxH399kK6OjJI4T2M33UsLqgutwj3SuYbk7Yt6sxE/pub. The data has already been downloaded and is in the file “gdp_per_capita_ppp.xlsx” in the current working directory. Note how we are going back to the beginning of the data analysis process in order to make our data exploration more meaningful.

income <- read_excel("DataFiles/gdp_per_capita_ppp.xlsx")
head(income)
## # A tibble: 6 x 217
##   `GDP per capita` `1800.0` `1801.0` `1802.0` `1803.0` `1804.0` `1805.0`
##   <chr>               <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 Abkhazia               NA       NA       NA       NA       NA       NA
## 2 Afghanistan           603      603      603      603      603      603
## 3 Akrotiri and Dh…       NA       NA       NA       NA       NA       NA
## 4 Albania               667      667      668      668      668      668
## 5 Algeria               716      716      717      718      719      720
## 6 American Samoa         NA       NA       NA       NA       NA       NA
## # ... with 210 more variables: `1806.0` <dbl>, `1807.0` <dbl>,
## #   `1808.0` <dbl>, `1809.0` <dbl>, `1810.0` <dbl>, `1811.0` <dbl>,
## #   `1812.0` <dbl>, `1813.0` <dbl>, `1814.0` <dbl>, `1815.0` <dbl>,
## #   `1816.0` <dbl>, `1817.0` <dbl>, `1818.0` <dbl>, `1819.0` <dbl>,
## #   `1820.0` <dbl>, `1821.0` <dbl>, `1822.0` <dbl>, `1823.0` <dbl>,
## #   `1824.0` <dbl>, `1825.0` <dbl>, `1826.0` <dbl>, `1827.0` <dbl>,
## #   `1828.0` <dbl>, `1829.0` <dbl>, `1830.0` <dbl>, `1831.0` <dbl>,
## #   `1832.0` <dbl>, `1833.0` <dbl>, `1834.0` <dbl>, `1835.0` <dbl>,
## #   `1836.0` <dbl>, `1837.0` <dbl>, `1838.0` <dbl>, `1839.0` <dbl>,
## #   `1840.0` <dbl>, `1841.0` <dbl>, `1842.0` <dbl>, `1843.0` <dbl>,
## #   `1844.0` <dbl>, `1845.0` <dbl>, `1846.0` <dbl>, `1847.0` <dbl>,
## #   `1848.0` <dbl>, `1849.0` <dbl>, `1850.0` <dbl>, `1851.0` <dbl>,
## #   `1852.0` <dbl>, `1853.0` <dbl>, `1854.0` <dbl>, `1855.0` <dbl>,
## #   `1856.0` <dbl>, `1857.0` <dbl>, `1858.0` <dbl>, `1859.0` <dbl>,
## #   `1860.0` <dbl>, `1861.0` <dbl>, `1862.0` <dbl>, `1863.0` <dbl>,
## #   `1864.0` <dbl>, `1865.0` <dbl>, `1866.0` <dbl>, `1867.0` <dbl>,
## #   `1868.0` <dbl>, `1869.0` <dbl>, `1870.0` <dbl>, `1871.0` <dbl>,
## #   `1872.0` <dbl>, `1873.0` <dbl>, `1874.0` <dbl>, `1875.0` <dbl>,
## #   `1876.0` <dbl>, `1877.0` <dbl>, `1878.0` <dbl>, `1879.0` <dbl>,
## #   `1880.0` <dbl>, `1881.0` <dbl>, `1882.0` <dbl>, `1883.0` <dbl>,
## #   `1884.0` <dbl>, `1885.0` <dbl>, `1886.0` <dbl>, `1887.0` <dbl>,
## #   `1888.0` <dbl>, `1889.0` <dbl>, `1890.0` <dbl>, `1891.0` <dbl>,
## #   `1892.0` <dbl>, `1893.0` <dbl>, `1894.0` <dbl>, `1895.0` <dbl>,
## #   `1896.0` <dbl>, `1897.0` <dbl>, `1898.0` <dbl>, `1899.0` <dbl>,
## #   `1900.0` <dbl>, `1901.0` <dbl>, `1902.0` <dbl>, `1903.0` <dbl>,
## #   `1904.0` <dbl>, `1905.0` <dbl>, …

Note how the column with country names has been named GDP per capita and the year column names have a .0 format. Let us change the column names and, as usual, gather columns with year value names in one single column and create a column with income data for each country and year.

names(income) <- c("Country", as.character(seq(1800, 2015, 1)))
income2 <- gather(income, "Year", "Income", -Country)
glimpse(income2)
## Observations: 56,592
## Variables: 3
## $ Country <chr> "Abkhazia", "Afghanistan", "Akrotiri and Dhekelia", "A...
## $ Year    <chr> "1800", "1800", "1800", "1800", "1800", "1800", "1800"...
## $ Income  <dbl> NA, 603, NA, 667, 716, NA, 1197, 618, NA, 757, 1507, 5...

This looks better!

Do HIV2 and income2 contain the same countries?

nrow(HIV2)
## [1] 6050
nrow(income2)
## [1] 56592

We need to combine the HIV2 and income3 data sets but, from the results above above, we observe that they have a different number of rows (countries). So, we must intersect both data sets, i.e. merge them leaving out the data for countries which are not in both data sets. We use the function inner_join() (package dplyr) which, by default, will do a natural join, using all variables with common names across the two tables (in this case Country and Year).

HIV_Inc <- inner_join(HIV2, income2)#merging HIV2 and income2
## Joining, by = c("Country", "Year")
HIV_Inc
## # A tibble: 5,720 x 4
##    Country               Year  PrevalenceHIV Income
##    <chr>                 <chr>         <dbl>  <dbl>
##  1 Abkhazia              1990          NA        NA
##  2 Afghanistan           1990          NA      1028
##  3 Akrotiri and Dhekelia 1990          NA        NA
##  4 Albania               1990          NA      4350
##  5 Algeria               1990           0.06  10113
##  6 American Samoa        1990          NA        NA
##  7 Andorra               1990          NA     28417
##  8 Angola                1990           0.5    4232
##  9 Anguilla              1990          NA        NA
## 10 Antigua and Barbuda   1990          NA     17154
## # ... with 5,710 more rows

Checking that we got the right number of rows in HIV_inc

aux <- intersect(HIV$Country, income$Country) # this vector contains the common countries in HIV and income
length(aux) * 22 # we are considering 22 years per country so this number should be equal to the number of rows in HIV_Inc
## [1] 5720
nrow(HIV_Inc) # the number of rows in HIV_Inc
## [1] 5720

To add more interest to our visualisation, we add region (continent, sub-continent) information downloaded from https://www.gapminder.org/data/geo/ into the file “DataGeographiesGapminder.xlsx”. This is a workbook with many sheets. The second sheet is the one that contain the list of country names and different region denominations and other geographical information.

continent <- read_excel("DataFiles/DataGeographiesGapminder.xlsx", sheet = 2)# read only the second sheet
head(continent)
## # A tibble: 6 x 11
##   geo   name  four_regions eight_regions six_regions members_oecd_g77
##   <chr> <chr> <chr>        <chr>         <chr>       <chr>           
## 1 afg   Afgh… asia         asia_west     south_asia  g77             
## 2 alb   Alba… europe       europe_east   europe_cen… others          
## 3 dza   Alge… africa       africa_north  middle_eas… g77             
## 4 and   Ando… europe       europe_west   europe_cen… others          
## 5 ago   Ango… africa       africa_sub_s… sub_sahara… g77             
## 6 atg   Anti… americas     america_north america     g77             
## # ... with 5 more variables: Latitude <dbl>, Longitude <dbl>, `UN member
## #   since` <dttm>, `World bank region` <chr>, `World bank income group
## #   2017` <chr>
continent <- rename(continent, Country = name)
glimpse(continent)
## Observations: 197
## Variables: 11
## $ geo                            <chr> "afg", "alb", "dza", "and", "ag...
## $ Country                        <chr> "Afghanistan", "Albania", "Alge...
## $ four_regions                   <chr> "asia", "europe", "africa", "eu...
## $ eight_regions                  <chr> "asia_west", "europe_east", "af...
## $ six_regions                    <chr> "south_asia", "europe_central_a...
## $ members_oecd_g77               <chr> "g77", "others", "g77", "others...
## $ Latitude                       <dbl> 33.00000, 41.00000, 28.00000, 4...
## $ Longitude                      <dbl> 66.00000, 20.00000, 3.00000, 1....
## $ `UN member since`              <dttm> 1946-11-19, 1955-12-14, 1962-1...
## $ `World bank region`            <chr> "South Asia", "Europe & Central...
## $ `World bank income group 2017` <chr> "Low income", "Upper middle inc...

Next we merge this information with HIV and income data

HIV_Inc_Cont <- inner_join(HIV_Inc, continent)
## Joining, by = "Country"
glimpse(HIV_Inc_Cont)
## Observations: 4,312
## Variables: 14
## $ Country                        <chr> "Afghanistan", "Albania", "Alge...
## $ Year                           <chr> "1990", "1990", "1990", "1990",...
## $ PrevalenceHIV                  <dbl> NA, NA, 0.06, NA, 0.50, NA, 0.3...
## $ Income                         <dbl> 1028, 4350, 10113, 28417, 4232,...
## $ geo                            <chr> "afg", "alb", "dza", "and", "ag...
## $ four_regions                   <chr> "asia", "europe", "africa", "eu...
## $ eight_regions                  <chr> "asia_west", "europe_east", "af...
## $ six_regions                    <chr> "south_asia", "europe_central_a...
## $ members_oecd_g77               <chr> "g77", "others", "g77", "others...
## $ Latitude                       <dbl> 33.00000, 41.00000, 28.00000, 4...
## $ Longitude                      <dbl> 66.00000, 20.00000, 3.00000, 1....
## $ `UN member since`              <dttm> 1946-11-19, 1955-12-14, 1962-1...
## $ `World bank region`            <chr> "South Asia", "Europe & Central...
## $ `World bank income group 2017` <chr> "Low income", "Upper middle inc...

The plan is to plot prevalence vs. income distinguishing with colours by continent and having 5 parallel plots, one for each of years 1990, 1995, 2000, 2005, 2011. The filter() function of the dplyr package serves to subset the data according to a logical criterion.

aux <- filter(HIV_Inc_Cont, Year %in% c("1990", "1995", "2000", "2005", "2011"))
aux %>% 
  ggplot(aes(x = Income, y = PrevalenceHIV, col = four_regions) ) + 
  geom_point(alpha=0.8) + 
  labs(x = "GDP per capita ($) - inflation adjusted" ) +
  labs(y = "Estimated HIV prevalence (%)" ) +
  ggtitle("Plot of HIV prevalence vs income - all nations") +
  facet_grid(.~Year) + # one plot for each of the desired years
  theme(legend.position = "bottom")
## Warning: Removed 249 rows containing missing values (geom_point).

As we can see in the plots above and below, most African countries have prevalence values in a scale which is about ten times that of the rest of the world. This makes the visualisation difficult and we will visualise the data for African countries separately.

ggplot(HIV_Inc_Cont, aes(x = four_regions, y = PrevalenceHIV)) +
  geom_boxplot()

#Only Africa
aux2 <- filter(aux, four_regions == "africa")# we further filter the data to select only countries in Africa
p_africa <- ggplot(aux2, aes(x = Income, y = PrevalenceHIV) ) + 
          geom_point(alpha = 0.8, color = "green", show.legend = FALSE) +
          labs(x = "GDP per capita ($) - inflation adjusted" ) +
          labs(y = "Estimated HIV prevalence (%)" ) +
          ggtitle("Plot of HIV prevalence vs income - Africa") +
          facet_grid(.~Year) 
    
p_africa


To gain more insight, let us identify the African countries with HIV prevalence greater than or equal to 10%.

#for year 1990
x_90 <- filter(aux2, PrevalenceHIV >= 10 & Year == "1990") #further filter the data selecting prevalence>=10 and year 1990
select(x_90, 1:5)
## # A tibble: 3 x 5
##   Country  Year  PrevalenceHIV Income geo  
##   <chr>    <chr>         <dbl>  <dbl> <chr>
## 1 Uganda   1990           10.2    767 uga  
## 2 Zambia   1990           12.7   2407 zmb  
## 3 Zimbabwe 1990           10.1   2532 zwe
#for year 1995
x_95 <- filter(aux2, PrevalenceHIV >= 10 & Year == "1995")
select(x_95, 1:5)
## # A tibble: 7 x 5
##   Country   Year  PrevalenceHIV Income geo  
##   <chr>     <chr>         <dbl>  <dbl> <chr>
## 1 Botswana  1995           16.6   8823 bwa  
## 2 Kenya     1995           10.3   2199 ken  
## 3 Lesotho   1995           14.3   1466 lso  
## 4 Malawi    1995           13.9    593 mwi  
## 5 Swaziland 1995           10.6   5043 swz  
## 6 Zambia    1995           15     2106 zmb  
## 7 Zimbabwe  1995           25.1   2416 zwe
#for year 2000
x_00 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2000")
select(x_00, 1:5)
## # A tibble: 8 x 5
##   Country      Year  PrevalenceHIV Income geo  
##   <chr>        <chr>         <dbl>  <dbl> <chr>
## 1 Botswana     2000           26    10250 bwa  
## 2 Lesotho      2000           24.5   1629 lso  
## 3 Malawi       2000           14.2    632 mwi  
## 4 Namibia      2000           15.3   6111 nam  
## 5 South Africa 2000           16.1   9927 zaf  
## 6 Swaziland    2000           22.3   5257 swz  
## 7 Zambia       2000           14.4   2202 zmb  
## 8 Zimbabwe     2000           24.8   2521 zwe
#for year 2005
x_05 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2005")
select(x_05, 1:5) 
## # A tibble: 9 x 5
##   Country      Year  PrevalenceHIV Income geo  
##   <chr>        <chr>         <dbl>  <dbl> <chr>
## 1 Botswana     2005           25.5  11460 bwa  
## 2 Lesotho      2005           23.6   1810 lso  
## 3 Malawi       2005           12.1    609 mwi  
## 4 Mozambique   2005           11.2    774 moz  
## 5 Namibia      2005           15.7   7279 nam  
## 6 South Africa 2005           18.1  11133 zaf  
## 7 Swaziland    2005           25.6   5618 swz  
## 8 Zambia       2005           13.9   2620 zmb  
## 9 Zimbabwe     2005           18.4   1689 zwe
#for year 2011
x_11 <- filter(aux2, PrevalenceHIV >= 10 & Year == "2011")
select(x_11, 1:5)
## # A tibble: 9 x 5
##   Country      Year  PrevalenceHIV Income geo  
##   <chr>        <chr>         <dbl>  <dbl> <chr>
## 1 Botswana     2011           23.4  14341 bwa  
## 2 Lesotho      2011           23.3   2301 lso  
## 3 Malawi       2011           10      747 mwi  
## 4 Mozambique   2011           11.3    974 moz  
## 5 Namibia      2011           13.4   8715 nam  
## 6 South Africa 2011           17.3  12291 zaf  
## 7 Swaziland    2011           26     5846 swz  
## 8 Zambia       2011           12.5   3557 zmb  
## 9 Zimbabwe     2011           14.9   1626 zwe
library(ggrepel)

To add country name labels to the plots we use the function geom_text_repel() in the package ggrepel. The most important feature of ggrepel is that it avoids that the labels overlap when the point they identify are very near.

#Let us add the names of the countries with high HIV prevalence to the plots.

p_africa <- p_africa + 
             geom_text_repel(data = x_90, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_95, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_00, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_05, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_11, aes(label = geo) , col = "black", size = 3) 
             
p_africa

Note that countries with missing data are not in the plot.


Which countries are getting richer? Is that reflecting on the HIV prevalence?

#for year 1990
x_90 <- filter(aux2, Income >= 15000 & Year == "1990") # select african countries data for year 1990 and income>=15000
x_90
## # A tibble: 2 x 14
##   Country Year  PrevalenceHIV Income geo   four_regions eight_regions
##   <chr>   <chr>         <dbl>  <dbl> <chr> <chr>        <chr>        
## 1 Gabon   1990            0.9  19358 gab   africa       africa_sub_s…
## 2 Libya   1990           NA    26928 lby   africa       africa_north 
## # ... with 7 more variables: six_regions <chr>, members_oecd_g77 <chr>,
## #   Latitude <dbl>, Longitude <dbl>, `UN member since` <dttm>, `World bank
## #   region` <chr>, `World bank income group 2017` <chr>
#for year 1995
x_95 <- filter(aux2, Income >= 15000 & Year == "1995")
select(x_95, 1:5)
## # A tibble: 3 x 5
##   Country    Year  PrevalenceHIV Income geo  
##   <chr>      <chr>         <dbl>  <dbl> <chr>
## 1 Gabon      1995            3.1  19738 gab  
## 2 Libya      1995           NA    23363 lby  
## 3 Seychelles 1995           NA    15097 syc
#for year 2000
x_00 <- filter(aux2, Income >= 15000 & Year == "2000")
select(x_00, 1:5)
## # A tibble: 3 x 5
##   Country    Year  PrevalenceHIV Income geo  
##   <chr>      <chr>         <dbl>  <dbl> <chr>
## 1 Gabon      2000            5.2  17630 gab  
## 2 Libya      2000           NA    22682 lby  
## 3 Seychelles 2000           NA    18453 syc
#for year 2005
x_05 <- filter(aux2, Income >= 15000 & Year == "2005")
select(x_05, 1:5)
## # A tibble: 4 x 5
##   Country           Year  PrevalenceHIV Income geo  
##   <chr>             <chr>         <dbl>  <dbl> <chr>
## 1 Equatorial Guinea 2005            3.6  36200 gnq  
## 2 Gabon             2005            5.4  17069 gab  
## 3 Libya             2005           NA    26967 lby  
## 4 Seychelles        2005           NA    17803 syc
#for year 2011
x_11 <- filter(aux2, Income >= 15000 & Year == "2011")
select(x_11, 1:5)
## # A tibble: 4 x 5
##   Country           Year  PrevalenceHIV Income geo  
##   <chr>             <chr>         <dbl>  <dbl> <chr>
## 1 Equatorial Guinea 2011            4.7  35150 gnq  
## 2 Gabon             2011            5    16590 gab  
## 3 Mauritius         2011            1    16179 mus  
## 4 Seychelles        2011           NA    22556 syc
#Let us add the names of the countries with high income to the plots.

p_africa <- p_africa + 
             geom_text_repel(data = x_90, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_95, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_00, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_05, aes(label = geo) , col = "black", size = 3) +
             geom_text_repel(data = x_11, aes(label = geo) , col = "black", size = 3) 
             
p_africa




library(plotly)
ggplotly(p_africa)
## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[1L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomTextRepel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

Similar analysis for the rest of the continents

aux2_r <- filter(aux, four_regions != "africa")
p_rest <- ggplot(aux2_r, aes(x = Income, y = PrevalenceHIV, col = four_regions) ) + 
          geom_point(alpha=0.8) +
          labs(x = "GDP per capita ($) - inflation adjusted" ) +
          labs(y = "Estimated HIV prevalence (%)" ) +
          ggtitle("Plot of HIV prevalence vs income - Americas, Asia and Europe") +
          facet_grid(.~Year) +
          theme(legend.position = "bottom")
p_rest


Let us identify the countries with HIV prevalence greater than or equal to 1%.

#for year 1990
x_90 <- filter(aux2_r, PrevalenceHIV >= 1, Year == "1990")
x_90
## # A tibble: 6 x 14
##   Country Year  PrevalenceHIV Income geo   four_regions eight_regions
##   <chr>   <chr>         <dbl>  <dbl> <chr> <chr>        <chr>        
## 1 Bahamas 1990            3.6  24281 bhs   americas     america_north
## 2 Guyana  1990            2.8   3231 guy   americas     america_south
## 3 Haiti   1990            1.3   2242 hti   americas     america_north
## 4 Hondur… 1990            1.1   3205 hnd   americas     america_north
## 5 Jamaica 1990            2.1   7391 jam   americas     america_north
## 6 Thaila… 1990            1     6369 tha   asia         east_asia_pa…
## # ... with 7 more variables: six_regions <chr>, members_oecd_g77 <chr>,
## #   Latitude <dbl>, Longitude <dbl>, `UN member since` <dttm>, `World bank
## #   region` <chr>, `World bank income group 2017` <chr>
#for year 1995
x_95 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "1995")
select(x_95, 1:7)
## # A tibble: 9 x 7
##   Country  Year  PrevalenceHIV Income geo   four_regions eight_regions    
##   <chr>    <chr>         <dbl>  <dbl> <chr> <chr>        <chr>            
## 1 Bahamas  1995            3.7  22119 bhs   americas     america_north    
## 2 Belize   1995            1.5   6209 blz   americas     america_north    
## 3 Cambodia 1995            1.4   1091 khm   asia         east_asia_pacific
## 4 Guyana   1995            2.2   4533 guy   americas     america_south    
## 5 Haiti    1995            3.6   1672 hti   americas     america_north    
## 6 Honduras 1995            1.5   3344 hnd   americas     america_north    
## 7 Jamaica  1995            2.2   8644 jam   americas     america_north    
## 8 Panama   1995            1.6   8795 pan   americas     america_north    
## 9 Thailand 1995            2.1   9239 tha   asia         east_asia_pacific
#for year 2000
x_00 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2000")
select(x_00, 1:7)
## # A tibble: 12 x 7
##    Country    Year  PrevalenceHIV Income geo   four_regions eight_regions 
##    <chr>      <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
##  1 Bahamas    2000            3.2  25858 bhs   americas     america_north 
##  2 Belize     2000            2.2   7215 blz   americas     america_north 
##  3 Cambodia   2000            1.3   1368 khm   asia         east_asia_pac…
##  4 Dominican… 2000            1     7955 dom   americas     america_north 
##  5 Guyana     2000            1.5   5071 guy   americas     america_south 
##  6 Haiti      2000            2.8   1734 hti   americas     america_north 
##  7 Honduras   2000            1.3   3483 hnd   americas     america_north 
##  8 Jamaica    2000            1.9   8139 jam   americas     america_north 
##  9 Panama     2000            1.4   9954 pan   americas     america_north 
## 10 Suriname   2000            1     9908 sur   americas     america_south 
## 11 Thailand   2000            1.8   8939 tha   asia         east_asia_pac…
## 12 Trinidad … 2000            1.2  17721 tto   americas     america_north
#for year 2005
x_05 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2005")
select(x_05, 1:7)
## # A tibble: 11 x 7
##    Country    Year  PrevalenceHIV Income geo   four_regions eight_regions 
##    <chr>      <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
##  1 Bahamas    2005            3    25397 bhs   americas     america_north 
##  2 Belize     2005            2.4   8202 blz   americas     america_north 
##  3 Estonia    2005            1.1  21651 est   europe       europe_east   
##  4 Guyana     2005            1.1   5140 guy   americas     america_south 
##  5 Haiti      2005            2.1   1562 hti   americas     america_north 
##  6 Jamaica    2005            1.8   8803 jam   americas     america_north 
##  7 Panama     2005            1.1  11156 pan   americas     america_north 
##  8 Suriname   2005            1.1  12225 sur   americas     america_south 
##  9 Thailand   2005            1.5  10901 tha   asia         east_asia_pac…
## 10 Trinidad … 2005            1.3  25439 tto   americas     america_north 
## 11 Ukraine    2005            1.1   7265 ukr   europe       europe_east
#for year 2011
x_11 <- filter(aux2_r, PrevalenceHIV >= 1 & Year == "2011")
select(x_05, 1:7)
## # A tibble: 11 x 7
##    Country    Year  PrevalenceHIV Income geo   four_regions eight_regions 
##    <chr>      <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
##  1 Bahamas    2005            3    25397 bhs   americas     america_north 
##  2 Belize     2005            2.4   8202 blz   americas     america_north 
##  3 Estonia    2005            1.1  21651 est   europe       europe_east   
##  4 Guyana     2005            1.1   5140 guy   americas     america_south 
##  5 Haiti      2005            2.1   1562 hti   americas     america_north 
##  6 Jamaica    2005            1.8   8803 jam   americas     america_north 
##  7 Panama     2005            1.1  11156 pan   americas     america_north 
##  8 Suriname   2005            1.1  12225 sur   americas     america_south 
##  9 Thailand   2005            1.5  10901 tha   asia         east_asia_pac…
## 10 Trinidad … 2005            1.3  25439 tto   americas     america_north 
## 11 Ukraine    2005            1.1   7265 ukr   europe       europe_east
p_rest <- p_rest + 
            geom_text_repel(data = x_90, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_95, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_00, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_05, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_11, aes(label = geo, col = four_regions) , size = 3) 
            
p_rest


Which countries are getting richer? Is that reflecting on the HIV prevalence?

#for year 1990
x_90 <- filter(aux2_r, Income >= 50000 & Year == "1990")
select(x_90, 1:7)
## # A tibble: 4 x 7
##   Country     Year  PrevalenceHIV Income geo   four_regions eight_regions 
##   <chr>       <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
## 1 Brunei      1990          NA     77076 brn   asia         east_asia_pac…
## 2 Luxembourg  1990           0.1   56922 lux   europe       europe_west   
## 3 Qatar       1990           0.06  73402 qat   asia         asia_west     
## 4 United Ara… 1990          NA    114832 are   asia         asia_west
#for year 1995
x_95 <- filter(aux2_r, Income >= 50000 & Year == "1995")
select(x_95, 1:7)
## # A tibble: 6 x 7
##   Country     Year  PrevalenceHIV Income geo   four_regions eight_regions 
##   <chr>       <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
## 1 Brunei      1995          NA     78406 brn   asia         east_asia_pac…
## 2 Kuwait      1995          NA     82268 kwt   asia         asia_west     
## 3 Luxembourg  1995           0.2   64568 lux   europe       europe_west   
## 4 Norway      1995           0.1   50616 nor   europe       europe_west   
## 5 Qatar       1995           0.06  77809 qat   asia         asia_west     
## 6 United Ara… 1995          NA    106425 are   asia         asia_west
#for year 2000
x_00 <- filter(aux2_r, Income >= 50000 & Year == "2000")
select(x_00, 1:7)
## # A tibble: 9 x 7
##   Country     Year  PrevalenceHIV Income geo   four_regions eight_regions 
##   <chr>       <chr>         <dbl>  <dbl> <chr> <chr>        <chr>         
## 1 Brunei      2000          NA     74475 brn   asia         east_asia_pac…
## 2 Kuwait      2000          NA     75219 kwt   asia         asia_west     
## 3 Luxembourg  2000           0.2   81425 lux   europe       europe_west   
## 4 Monaco      2000          NA     50200 mco   europe       europe_west   
## 5 Norway      2000           0.1   58699 nor   europe       europe_west   
## 6 Qatar       2000           0.06 112238 qat   asia         asia_west     
## 7 San Marino  2000          NA     51350 smr   europe       europe_west   
## 8 Singapore   2000           0.1   51663 sgp   asia         east_asia_pac…
## 9 United Ara… 2000          NA    108048 are   asia         asia_west
#for year 2005
x_05 <- filter(aux2_r, Income >= 50000 & Year == "2005")
select(x_05, 1:7)
## # A tibble: 10 x 7
##    Country     Year  PrevalenceHIV Income geo   four_regions eight_regions
##    <chr>       <chr>         <dbl>  <dbl> <chr> <chr>        <chr>        
##  1 Brunei      2005          NA     74441 brn   asia         east_asia_pa…
##  2 Kuwait      2005          NA     92665 kwt   asia         asia_west    
##  3 Luxembourg  2005           0.3   88944 lux   europe       europe_west  
##  4 Monaco      2005          NA     52761 mco   europe       europe_west  
##  5 Norway      2005           0.1   63573 nor   europe       europe_west  
##  6 Qatar       2005           0.06 119134 qat   asia         asia_west    
##  7 San Marino  2005          NA     53928 smr   europe       europe_west  
##  8 Singapore   2005           0.1   61921 sgp   asia         east_asia_pa…
##  9 Switzerland 2005           0.3   51069 che   europe       europe_west  
## 10 United Ara… 2005          NA    102324 are   asia         asia_west
#for year 2011
x_11 <- filter(aux2_r, Income >= 50000 & Year == "2011")
select(x_11, 1:7)
## # A tibble: 10 x 7
##    Country     Year  PrevalenceHIV Income geo   four_regions eight_regions
##    <chr>       <chr>         <dbl>  <dbl> <chr> <chr>        <chr>        
##  1 Brunei      2011           NA    71991 brn   asia         east_asia_pa…
##  2 Hong Kong,… 2011           NA    50086 hkg   asia         east_asia_pa…
##  3 Kuwait      2011           NA    79102 kwt   asia         asia_west    
##  4 Luxembourg  2011            0.3  91469 lux   europe       europe_west  
##  5 Monaco      2011           NA    58081 mco   europe       europe_west  
##  6 Norway      2011            0.1  62737 nor   europe       europe_west  
##  7 Qatar       2011           NA   133734 qat   asia         asia_west    
##  8 Singapore   2011            0.1  74949 sgp   asia         east_asia_pa…
##  9 Switzerland 2011            0.4  54551 che   europe       europe_west  
## 10 United Ara… 2011           NA    56192 are   asia         asia_west
p_rest <- p_rest + geom_text_repel(data = x_90, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_95, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_00, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_05, aes(label = geo, col = four_regions) , size = 3) +
            geom_text_repel(data = x_11, aes(label = geo, col = four_regions) , size = 3) 
          
p_rest


EXERCISE: Carry out a visualisation of HIV prevalence data for the Americas, distinguishing between the sub-regions in the Americas.